Abstract

Author: Charles Tapley Hoyt

Estimated Run Time: 4 minutes

This notebook calculates the concordance between differential gene expression and the causal relationships between their protein products. This analysis assumes the forward hypothesis, under which gene expression is believed to be correlated with protein activity. While there are many faults to this assumption, it directly enables a very simple analysis of a knowledge assembly.

Notebook Imports


In [1]:
import logging
import os
import time

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pandas.tools.plotting import scatter_matrix
from matplotlib_venn import venn2
import seaborn as sns

import pybel
import pybel_tools as pbt
from pybel_tools.analysis.concordance import *
from pybel.constants import *
from pybel.canonicalize import calculate_canonical_name
from pybel_tools.visualization import to_jupyter

In [2]:
%config InlineBackend.figure_format = 'svg'
%matplotlib inline

Notebook Provenance

The time of execution, random number generator seed, and the versions of the software packages used are displayed explicitly.


In [3]:
time.asctime()


Out[3]:
'Sun Aug 27 11:39:19 2017'

In [4]:
# seed the random number generator
import random
random.seed(127)

In [5]:
pybel.__version__


Out[5]:
'0.7.3-dev'

In [6]:
pbt.__version__


Out[6]:
'0.2.2-dev'

Local Path Definitions

To make this notebook interoperable across many machines, locations to the repositories that contain the data used in this notebook are referenced from the environment, set in ~/.bashrc to point to the place where the repositories have been cloned. Assuming the repositories have been git clone'd into the ~/dev folder, the entries in ~/.bashrc should look like:

...
export BMS_BASE=~/dev/bms
...

BMS

The biological model store (BMS) is the internal Fraunhofer SCAI repository for keeping BEL models under version control. It can be downloaded from https://tor-2.scai.fraunhofer.de/gf/project/bms/


In [7]:
bms_base = os.environ['BMS_BASE']

OwnCloud

The differential gene expression data used in this notebook is currently not published, and is obfuscated with reference through our team's internal data storage system with OwnCloud.


In [8]:
owncloud_base = os.environ['OWNCLOUD_BASE']

Data

Alzheimer's Disease Knowledge Assembly

The Alzheimer's Disease Knowledge Assembly has been precompiled with the following command line script, and will be loaded from this format for improved performance. In general, derived data, such as the gpickle representation of a BEL script, are not saved under version control to ensure that the most up-to-date data is always used.

pybel convert --path "$BMS_BASE/aetionomy/alzheimers.bel" --pickle "$BMS_BASE/aetionomy/alzheimers.gpickle"

The BEL script can also be compiled from inside this notebook with the following python code:

>>> import os
>>> import pybel
>>> # Input from BEL script
>>> bel_path = os.path.join(bms_base, 'aetionomy', 'alzheimers.bel')
>>> graph = pybel.from_path(bel_path)
>>> # Output to gpickle for fast loading later
>>> pickle_path = os.path.join(bms_base, 'aetionomy', 'alzheimers.gpickle')
>>> pybel.to_pickle(graph, pickle_path)

In [9]:
pickle_path = os.path.join(bms_base, 'aetionomy', 'alzheimers', 'alzheimers.gpickle')

In [10]:
graph = pybel.from_pickle(pickle_path)

In [11]:
graph.version


Out[11]:
'4.0.3'

All orthologies are discarded before analysis.


In [12]:
pbt.filters.remove_nodes_by_namespace(graph, {'MGI', 'RGD'})

Following the forward hypothesis, the knowledge graph is collapsed to genes using pbt.mutation.collapse_by_central_dogma_to_genes.


In [13]:
pbt.mutation.collapse_by_central_dogma_to_genes(graph)

An additional assumption is made about the activities of variants of proteins and genes. All variants are collapsed to the reference gene.


In [14]:
pbt.mutation.rewire_variants_to_genes(graph)

In [15]:
pbt.summary.print_summary(graph)


Nodes: 3701
Edges: 19562
Citations: 1760
Authors: 9305
Network density: 0.0014285401315933604
Components: 70
Average degree: 5.2855984868954335
Compilation warnings: 2578

In [16]:
pbt.summary.plot_summary(graph, plt, figsize=(10, 4))
plt.show()


Differential Gene Expression

Differential gene expression data can be obtained from many sources, including ADNI and other large clinical studies. This analysis is concerned with the log-fold-changes on each gene, and not necessarily the p-value. This is better as a data-driven process becuase it does not require a model or multiple hypothesis testing on raw data.


In [17]:
data_path = os.path.join(owncloud_base, 'alzheimers', 'SevAD.csv')
target_columns = ['Gene.symbol', 'logFC']

In [18]:
df = pd.read_csv(data_path)
df = df.loc[df['Gene.symbol'].notnull(), target_columns]
df.head()


Out[18]:
Gene.symbol logFC
0 ZNF616 -4.244691
1 DEFB125 3.974393
3 SNAP23 3.337636
4 PHLDB2 3.192559
5 LOC389895 -4.296850

A histogram of the log-fold-changes shows that the data are normally distributed, as expecOn the left, the number of shared elements in the knowledge assembly and differential gene data set are counted.

On the right, a histogram of the log-fold-changes shows that the data are normally distributed, as expected for differential gene expression data.ted for differential gene expression data.


In [19]:
data = {k: v for _, k, v in df.itertuples()}

In [20]:
hgnc_names = pbt.summary.get_names_by_namespace(graph, 'HGNC')
df_names = set(df['Gene.symbol'])
overlapping_hgnc_names = hgnc_names & df_names

In [21]:
fix, (rax, lax) = plt.subplots(1, 2, figsize=(10, 3))

lax.set_title('Distribution of Log-Fold-Changes in E-GEOD-5281')
lax.set_xlabel('Log-Fold-Change')
lax.set_ylabel('Frequency')
lax.hist(list(data.values()))

rax.set_title('Gene Overlap')
venn2([hgnc_names, df_names], set_labels=["Alzheimer's Disease KA", 'E-GEOD-5281'], ax=rax)
plt.tight_layout()
plt.show()


Analysis

Data Integration

Finally, the differential gene expression data are ovelayed on the BEL graph with pbt.integration.overlay_type_data


In [22]:
key = 'weight'
cutoff = 0.3

In [23]:
pbt.integration.overlay_type_data(graph, data, key, GENE, 'HGNC', overwrite=False, impute=0)

In [24]:
graph.edge[graph.nodes()[25]].keys()


Out[24]:
dict_keys([('Pathology', 'DO', "Alzheimer's disease"), ('Gene', 'dbSNP', 'rs3757536')])

Concordance

The effect of the cutoff is explored when calculating the concordance of the full knowledge assembly.


In [25]:
cutoffs = np.linspace(0, 3, 50)

plt.plot(cutoffs, [
    calculate_concordance(graph, key, cutoff=cutoff)
    for cutoff in cutoffs
])

plt.title('Effect of Cutoff on Concordance \nfor {}'.format(graph))
plt.ylabel('Concordance')
plt.xlabel('Cutoff')
plt.show()


Stratify by Subgraph

The distribution of the concordance values across the stratified subgraphs is displayed below.


In [26]:
concordance_df = pd.DataFrame.from_dict(
    {
        value.replace(' subgraph', ''): calculate_concordance_helper(subgraph, key, cutoff)
        for value, subgraph in pbt.selection.get_subgraphs_by_annotation(graph, 'Subgraph').items()
    }, 
    orient='index'
)

concordance_df.columns = 'Correct', 'Incorrect', 'Ambiguous', 'Unassigned'

subgraphs that contain at least one correct or one incorrect relation are shown.


In [27]:
has_correct = concordance_df['Correct'] > 0
has_incorrect = concordance_df['Incorrect'] > 0

concordance_df[has_correct | has_incorrect].sort_values('Correct', ascending=False).head(20)


Out[27]:
Correct Incorrect Ambiguous Unassigned
Interleukin signaling 20 41 25 285
Tumor necrosis factor 20 29 16 214
miRNA 10 16 80 167
Insulin signal transduction 10 19 17 761
Nuclear factor Kappa beta 8 29 19 60
Inflammatory response 8 17 23 488
Chemokine signaling 6 12 4 168
Neurotrophic 6 7 1 56
Nerve growth factor 6 10 5 171
Caspase 5 18 15 156
DKK1 5 5 8 77
Albumin 3 1 0 26
Acetylcholine signaling 3 3 2 361
Bcl-2 3 11 5 77
Complement system 3 3 3 97
Amyloidogenic 3 27 62 1324
Endosomal lysosomal 2 16 8 281
Chaperone 2 9 6 63
Notch signaling 2 2 8 117
Metabolism of steroid hormones 2 0 0 6

The 20 highest concording subgraphs are shown. Even without statistical analysis, these data suggest that high concorance values are most likely due to random chance.


In [28]:
concordance_df['Concordance'] = concordance_df['Correct'] / (concordance_df['Correct'] + concordance_df['Incorrect'])

In [29]:
concordance_df.sort_values('Concordance', ascending=False).head(20)


Out[29]:
Correct Incorrect Ambiguous Unassigned Concordance
Plasminogen activator 1 0 2 97 1.000000
Metabolism of steroid hormones 2 0 0 6 1.000000
Cell cycle 1 0 7 83 1.000000
Vitamin 2 0 0 41 1.000000
Albumin 3 1 0 26 0.750000
Reactive oxygen species 2 1 4 184 0.666667
Interferon signaling 2 1 2 75 0.666667
Complement system 3 3 3 97 0.500000
Notch signaling 2 2 8 117 0.500000
DKK1 5 5 8 77 0.500000
Acetylcholine signaling 3 3 2 361 0.500000
Neurotrophic 6 7 1 56 0.461538
Tumor necrosis factor 20 29 16 214 0.408163
Binding and Uptake of Ligands by Scavenger Receptors 2 3 2 72 0.400000
miRNA 10 16 80 167 0.384615
Nerve growth factor 6 10 5 171 0.375000
Insulin signal transduction 10 19 17 761 0.344828
Prostaglandin 1 2 1 82 0.333333
Axonal transport 1 2 4 35 0.333333
Chemokine signaling 6 12 4 168 0.333333

While these calculations were presented in a dataframe, they can be directly acquired with calculate_concordance_by_annotation.


In [30]:
%%time
results = calculate_concordance_by_annotation(graph, 'Subgraph', key, cutoff)


CPU times: user 427 ms, sys: 30.1 ms, total: 458 ms
Wall time: 466 ms

The distribution of concordance values over all subgraphs is shown.


In [31]:
sns.distplot([
    result
    for result in results.values() 
    if result != -1
])

plt.title('Distribution of concordance values at $C={}$'.format(cutoff))
plt.xlabel('Concordance')
plt.ylabel('Frequency')
plt.xlim([0, 1])
plt.show()


Conclusions

Varying the threshold from just zero to just above zero to much more stringent reveals varying results. High-dimensional data visualization techniques will need to be used to identify the effect of the threshold, and eventually the stability of concordance values through randomized permutation tests to assess the reliability of this method.

Without further methodological development, this method seems unlikely to make useful assistance in analysis.